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Abstract. Measurements of optical turbulence time series data using unattended instruments 
over long time intervals inevitably lead to data drop-outs or degraded signals. We present 
a comparison of methods using both Principal Component Analysis, which is also known 
as the Karhunen-Loeve decomposition, and ARIMA that seek to correct for these event- 
induced and mechanically-induced signal drop-outs and degradations. We report on the 
quality of the correction by examining the Intrinsic Mode Functions generated by Empirical 
Mode Decomposition. The data studied are optical turbulence parameter time series from a 
commercial long path length optical anemometer/scintillometer, measured over several hundred 
metres in outdoor environments. 



1. Introduction 

How many data points can be missed off (set to zero) from a 1-D discrete, regularly spaced time 
series record and still be recoverable? Such a question is prompted by many situations, often 
associated to the automated collection of data. We are motivated to address this question by our 
experimental field work to record the behaviour of the strength of optical turbulence parameter, 
C 2 , over time intervals of several weeks, under different climate conditions [H El El [5] . 

The data of interest to us are path integrated measures of C 2 measured across the visible to 
near infrared region. The measurement path is a 600 m stretch of free space, approximately 1.5 
m altitude above sea water in the Caribbean. The probe beam originated from an LED centered 
on 0.9 /jm; the instrumentation are two identical Optical Scientific Inc. LOA-Oo43 systems, 
which employ aperture averaging to estimate the value of C 2 . The LOA-004s are generally 
left unattended over the course of up to a few days during field operation, which mean that 
they are susceptible to spurious events which result in gaps in the data record. Although for 
some types of data analysis techniques, data gaps of a limited size can be tolerated, this is not 
universal. Moreover, there exists a new class of very powerful techniques based on Empirical 
Mode Decomposition [61 2] that are exceedingly sensitive to lossy datasets. It is for this reason 
we are exploring different methodologies to synthetically fill the data holes; we describe the 
results from our studies using Principal Component Analysis in this paper. 
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2. Karhunen Loeve or Principal Component Analysis 

The principal components of any ensemble can be used to identify the members of that ensemble. 
This idea forms the foundation of face recognition and tracking through eigenfaces (see, for 
example, Turk and Pentland 1991 [8]). We may extend this method to reconstruct the missing 
data for any data record under given restrictions. The key point is that the gappy data record 
must have the same, or similar, salient features as all the members of the ensemble. 

The principal components are the eigenvectors of the covariance matrix of the data and 
represent the features of the dataset. Provided that a reference library can be created, each 
member of the library will contribute to each eigenvector, more or less. As such, each member 
can be exactly represented by a linear combination of eigenvectors. Any similar data record 
external to the library will also be represented by a linear combination of eigenvectors, within a 
margin of error. 

The first step for the filling procedure must therefore be to define the reference library. 
We may do so by collecting a family of turbulence data series that share certain specific 
characteristics; a difficult task since the definition of such is an open question. Moreover, 
since the mean value of the family of reference data plays a key part, all the members of the 
library would require normalisation. Again how to do this is an open question. Alternatively we 
may use the neighbouring record around a data hole, sectioning this information to provide the 
ensemble members. We opt for the latter technique since the record pre and post the data hole 
(within a certain time interval) ought to be similar in nature to the missing data. The mean 
value of this type of library would probably not differ greatly from the mean of the missing data, 
so normalisation would not be so crucial. 

How does one determine the principal components of a reference library? Following Sirovich 
and Kir by [9], let the M members (each of length N) of the reference ensemble be {p n }- Thus, 
the average data record of this ensemble will be 

1 M 

ip =< (p >= _ <Pn (1) 
n=l 

It is very reasonable to assume that departures from the mean record will provide an efficient 
procedure for extracting the primary features of the data. Therefore, we define 

<t>n = <Pn-~<P (2) 



Now, if we consider the dyadic matrix 

M 

C = ]T Ml = AA T (3) 

71=1 

where each term of the sum signifies a second rank tensor product, we can recognize this as the 
ensemble average of the two point correlation of the deviations from the mean. Here, A T is the 
transpose of A. 

We require eigenvectors u n of the matrix AA T . For ensembles whose members have a large 
number of points N > M, matrix AA T issingular and its order cannot exceed M. To find those 
eigenvectors of AA T corresponding to nonzero eigen values, Turk and Pentland used a standard 
singular value decomposition technique, as described below. 

A T Av n = fiv n (4) 
AA T Av n = fJ, n Av n 
CAv n = \xAv n 



where fj, n are the eigenvalues. This deduction can be equated to 

Cu n = fi n u n (5) 

where u n = Av n . Thus AA T and A T A have the same eigenvalues and their eigenvectors are 
related through u n = Av n , provided that \\u n \ \ = 1. The treatment described is recognizable as 
the Karhunen-Loeve (KL) method |10j . 

The implication is that a dataset (j)' can be obtained from a limited summation 

M 

(j)' sa ^2 a n u n (6) 

n=l 

where the coefficients a n are obtained through the inner product 

a n = (4>',u n ) (7) 
We emphasise that cp' is not considered to be part of the {4>n}, although it is similar in features. 



3. Proof of Concept 

To demonstrate the validity of the assertion of the previous section, we take a perfect data 
record of measurements over a 7 hour period starting from midnight, smoothed by a forward 
moving rolling average of 5 minutes' interval (60 data points). The data contain 2492 points in 
total. We split up the test data into 21 sections, where all but 1 are members of the reference 
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Figure 1. The test data, split into 21 sections. The data are padded before division with a set 
of points taken from the tail of the time series and mirrored outward. 



library, as shown in Fig. [TJ The exclusive section is set to zero, and the algorithm described 
in Sec. [5] is employed on the 20 library elements. The reconstructions are created from the KL 
coeffiecients of the eigenvectors equivalent to the sections adjacent to the missing data. Hence 
we will talk of a prior and posterior reconstruction meaning e.g. for omitted section 5, we use for 
the prior the KL coefficient equivalent to section 4 and for the posterior reconstruction we use 
the coefficient equivalent to section 6 of the test data set. Note that this does not reconstruct 
those sections, since the eigenvectors are generated from the entire reference library. 



The reconstructions shown in the left hand set of Fig. [2] represent the best level of error, 
while the right hand set shows the worst. It is clear that the greater the difference between 
the (masked off) original data section and its neighbours, the poorer the reconstruction will be. 
Nevertheless, the reconstruction errors for the full set of test data are all 1 order of magnitude 
less than the mean value of the reconstruction and the original data segment. Evidently the end 
points have not been synthesised to be continuous with the adjacent segments of the reference 
library signal, as can be seen from the error. A continuity condition in terms of the both the 
function and its derivative has to be imposed on both ends of the reconstructed segment in 
order to achieve smoothness. The most effective way to do so is currently being investigated. 
The eigenvalues determined through this method show a similar distribution in both cases, 
with minor variations only at the upper end of the spectrum, implying that the KL differences 
between best and worst section for reconstruction is not very large. 

3.1. KL and EMD 

In the absence of a workable continuity condition, we present here the effects of crudely patching 
the data hole with reconstructions determined from the prior and posterior terms with respect 
to the gap. 

We apply the Empirical Mode Decomposition (EMD) algorithm [U| to the reconstructions. 
EMD is a novel adaptive method for separating a nonlinear time series into components based 
on instantaneous frequency. Basically it acts as a dyadic filter bank [7]; the set for the best 
reconstruction case are illustrated in Figs. [3j We refer to these sets as EMDl and EM Dr. 
The original data's intrinsic mode functions (IMFs) and residuals (set EM Do) are also shown 
and for comparison, we present the effect of a simple minded linear interpolation between the 
endpoints of the known data on the IMFs. 

Upon visual inspection, we see that the original data generates 9 components: 8 intrinsic 
modes and 1 residual (the stopping criterion for our EMD implementation is the same as in 
Huang et al [B]). On the other hand, the interpolated data have only 8 components. Numbering 
the IMFs from highest instantaneous frequency to lowest, starting from IMF 1, we can see by 
inspection that both EMDl an d EMDr are strongly similar to EM Do in IMFs 4,5,6 and 7. 
The differences appear in the higher frequency components, due to the discontinuity between 
the inserted segment and the unadulterated data. The endpoint discontinuities evidently modify 
the variances of IMFs 1,2 and 3, although it seems that they are only affected in the area local 
to the discontinuity, per IMF. 

By way of comparison, a linear interpolant between the edges of the known data show that 
there is contamination all through the IMFs. It is so strong that IMF 7, which in the other sets 
clearly distinguishes the baseline rise and fall of the turbulence over the interval under study, is 
unable to pick out a clean pedestal. 

4. ARIMA 

Instead of decomposing the data sequence into KL components, an alternative method studied is 
that of ARIMA (Auto-Regressive Integrated Moving Average) [T5j . This technique's principle is 
founded on taking advantage of the past (first moment) behaviour of stochastic data to predict 
the future characteristics. It is a well known technique used in econometrix, and we have 
employed it here to investigate its effects on EMD. 

An ARIMA(p, d, q) process is the solution of the following equation 

<t>(B)(l-B) d X t = iP(B)e t (8) 

where BXt = Xt-\ is the backshift operator, et represents a white noise process, and (f>, ip are 
polynomials of degree p and q respectively. For mathematical convenience we take p and q to 
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Figure 2. Best and worst reconstruction result using PCA for (a) section 5 and (b) section 
13 respectively. The layout in each set is : (Top Left) Segment prior to the selection. (Top 
Middle) The original selected data. (Top Right) Segment after selection. (Centre Left) The 
difference between the reconstruction using the coefficient of eigenvector related to the prior 
segment and the original. The value shown is the mean absolute difference per pixel. (Centre 
Right) The (prior) reconstruction. (Bottom Left) The difference between the reconstruction 
using the coefficient related to the posterior segment and the original. (Bottom Middle) The 
(posterior) reconstruction. (Bottom Right) The eigenvalue spectrum. 




Figure 3. (Top Left) The IMFs 1 to 8 (top to bottom) and the residual trend line of EM Do, 
generated by applying Empirical Mode Decomposition to the test data. (Top Right) The IMFs 
1 to 7 and the residue of EMDl, generated from data using the PC A reconstruction method 
with the coefficient of the prior section. (Bottom Left) The IMFs 1 to 7 and the residue of 
EMDr, generated from data using the PC A reconstruction method of the posterior section. 
(Bottom Right) The IMFs 1 to 7 (top to bottom) and the residue, generated from data with a 
linear interpolant across section 5 of the test data. 



be zero. Fractional d generalises the differencing parameter between data points to obtain the 
long range dependence. In general, < d < 0.5, where longer memory is represented by higher 
values of d. 

4.1. ARIMA and EMD 

We applied an ARIMA(0,<i,0) model to predict the behaviour of the final section of the split 
test data. The length of each section is 119 points long; by basing the data behaviour on the 
previous 2373 points and by setting the difference operator to be d = 0.5 we generated the IMF 
results shown in Fig. [H 




Figure 4. The IMFs and the residual trend line of EM Do, generated by applying Empirical 
Mode Decomposition to ARIMA(0,d = 0.5,0). 



As can be seen, the effect of the ARIMA(0, d, 0) extrapolation is to fill in the higher order 
IMFs in a smooth fashion (apparently, by visual inspection , IMFs 5 to 8 are smoothly filled at 
the tail). It is only untrustworthy when considering the higher frequency components, because 
ARIMA considers only the conditional first moment. Note that the number of IMFs and residual 
are the same as the unadulterated signal EMD<j- 

5. CONCLUSIONS 

We have discussed a data hole filling method for stochastic data, a necessary step to be able to use 
the new technique of Empirical Mode Decomposition upon a time series record. The Karhunem- 
Loeve eigenvectors from an ensemble of neighbouring sections of complete data around a data 
hole can be used to reconstruct the missing segment to a reasonable degree of accuracy, at 
least for the purposes of applying EMD. We have shown that the edge continuity is important, 
although the effect of discontinuities is not universal through all the intrinsic modes of the data. 
The quality of the reconstruction is much better using PCA than a simplistic linear interpolant 
(or merely ignoring the data gap). We compared this to a simplified ARIMA(0,d,0) model, 
which performed better than the linear interpolant but less effectively than the KL algorithm, 
disregarding edge effects. 

In summary, when a data hole is present, merely ignoring the data hole or filling it with a 
linear interpolant is a poor technique from the point of view of EMD. The result of doing so 
leads to leakage of spurious effects both laterally in time (per IMF) and longitudinally across 
all the IMFs. By reconstructing the datagap with ARIMA(0, d = 0.5, 0) one can limit the 
leakage laterally and longitudinally, although the lowest IMFs remain strongly contaminated 
with artificial structure. KL reconstruction limits the leakage best, and even without considering 
the continuity between the adjacent sections and the reconstructed datagap, it promises to 
provide the best reconstruction of the three methods described in this paper. 
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